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Abstract 

The article deals with a kind of recursive function templates in C++, where the re- 
cursion is realized corresponding template parameters to achieve better computational 
performance. Some specialization of these template functions ends the recursion and 
can be implemented using optimized hardware dependent or independent routines. The 
method is applied in addition to the known expression templates technique to solve linear 
algebra expressions with the help of the BLAS library. The whole implementation pro- 
duces a new library, which keeps object-oriented benefits and has a higher computational 
speed represented in the tests. 

Introduction 

The C++ programming language has a powerful template facility that enables the de- 
velopment of flexible software without incurring a large abstraction penalty n,|2I- The 
main goal is the resolving of all templates of a program during the compilation time. In 
this way C++ language can be meant as a two-level language 5 . A function template 
takes both template parameters (solved in the compilation time) and function arguments 
(work dynamically in the program code). 

A naive implementation of linear algebra operations in C++ using the known object- 
oriented features, such as providing of classes and operator overloading, yields an ineffi- 
cient code. The main reason for that is generation of temporary objects by each over- 
loaded operator (P). This problem is usually solved by expression templates technique 
[^,[ni, which is implemented in the known optimized C++ libraries for linear algebra, e.g. 
valarray standard library (Linux RedHat 7.2), Blitz++ etc. Their tests and descriptions 
[H| prove that computational performance of such libraries can be equal to ones written 
in FORTRAN without any object-oriented limitations. 

But there are some processor and cash oriented implementations, which have better 
performance. The best example of such linear algebra library is Basic Linear Algebra 
Subprograms They exist for most of the hardware platforms with the same inter- 
face, although specifically implemented (in FORTRAN or Assembler) for some types of 
commonly used processors. 
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Therefore the combination of an expression templates abstraction with the incorpo- 
ration of BLAS in a speciahzation can produce a better performance than the best pure 
C-|--|- hbraries. 

Expression templates 

We consider a simple example of a vector expression ^ to see, how the expression tem- 
plates are used to collect the arguments and the operators of the vector expression. 

X = A + B + C. (2) 

As assumed in C-|— 1-, the right hand side of the equation ^ is resolved from left to 
right before the assign operator is applied. So, we have the series of operators: 

1. X = A + B + C . The first operator '-|-' is applied. 

2. X = BinClos < A,B,+ > +C. The second operator '-|-' is applied. 

2>. X = BinClos < BinClos < A,B,+ >,C,+ >. The assign operator '=' is applied. 

We have just introduced a new symbolic notation BinClos for a binary operation. It looks 
intentionally like a template class, where A, B and must be data types as template 
parameters. The basic trick of the approach is to substitute a template class as a template 
parameter into itself and to build parse trees using operator overloading |^ . 

The C-|— |- class for description of binary expression closure is defined using three 
template parameters: the right and the left hand side and the operation. But it has also 
to save references (not the values) to the arguments of the binary expression: 

template<class Left, class Right, class Oper> 
struct BinClos { 

const Leftfe argl; 

const Rightfe arg2; 

BinClos (const Leftfe a, const Rightfe b) :argl(a) ,arg2(b) {} //constructor 

}; 

Now we have to declare the class for a vector, which can be implemented later. Further, 
the addition operator should be described as a data type. The simple structure includes 
only the function apply to realize the addition: 

class Vector; // some vector class 

struct add { // encapsulates the '+' operation 

static double apply(double a, double b) { 
return a+b; 

} 

}; 

For the whole minimal implementation we need a C-|--|- operator, that yields the structure 
BinClos. There are some possibilities to define this operator. The following example 
represents an addition of any object of type Left with a vector: 

template<class Left> 

BinClos<Left , Vector ,add> operator + (Left& a, Vectorfe b) { 
return BinClos<Left, Vector, add>(a,b) ; 

} 
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So, the right hand side of the equation ^ is gathered in the compile time to the single 
template structure: 

BiiiClos<BinClos<Vector , Vector , add> , Vector , add> (3) 

The next step in the solution of is to assign the last complicated template structure 
(jSJ to the resulting vector X. We consider, at first, the usual approach that exists in 
the optimized C++ libraries, such as valarray and Blitz++ 4 . It uses the operator 
overloading to assign the whole expression in only one loop per component ((IJ). 

Xi = Ai + B, + a. (4) 

The structure BinClos is supplemented with an operator [] , that adds i-th components 
of two data members argl and arg2 of the structure. If the argl (or arg2) is a simple 
vector, than the operator [] is called in the class Vector, else the same operator is called 
recursively in the structure BinClos. That process is started by operator=, which is 
represented by the single loop and calls operator [] due to the expr [i] : 

template<class Left, class Right, class Oper> 
struct BinClos { 

const Leftfe argl; 

const Rightfe arg2; 

BinClos (const Left& a, const Rightfe b) :argl(a) ,arg2(b) {} 
double operator [] (int i) { 

return Oper : : apply (argl [i] , arg2 [i] ) ; 

} 

>; 

class Vector 

double* data; 

int size; 
public : 

//definition of constructors 
teniplate<class Left, class Right, class Oper> 
void operator= (const BinClos<Left , Right , Oper>& expr) { 
for (int i=0; Ksize; ++i) data[i] =expr [i] ; 

} 

double operator [] (int i) { return data[i] ; ]■ 

>; 

As it can be seen, the solution is distributed to some overloaded operators. Therefore, 
there is no possibility to substitute an external optimized subprogram. The next approach 
provides another template technique to realize the last step of the solution. 

Recursive function templates 

The basic idea of the following approach is to divide the right hand side of @ into simple 
units, which are bounded by addition or subtraction and than to apply the operations 
consequently to the vector X (fig. We have always two types of arguments and a 
type of operation at the top level of recursive built structure Q. These two types can be 
applied to the X with the type of the operation. If an argument is complicated, than the 
process continues in the same way recursively, else we get a simple addition or subtraction 
of a vector X with another vector. The last can be done in a single function without any 
overloading of operators. In the next step we have to construct the recursion using such 
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X -Mbih^C) or X =(A+B;+Cj 



Figure 1: Basic idea of the recursive solution 

functions. The optimal decision in terms of maintaining the efficiency is to apply recursion 
with respect to the template parameters of the functions. In order to minimize the number 
of the function parameters the functions are implemented as member functions of the 
structure BinClos. 

We need first to implement a C++ trait jlj to represent the rule of addition. In the 
terms of the set theory, we have a mapping: 

{+,+}-{+}, {+,-}-{-}, {-,+}-{-}, {-,-}-{+}• (5) 

Both addition and subtraction operations are now implemented as two empty structures 
and are used as template parameters for the recursive functions. They have only a meaning 
of different data types. 

struct add {}; 
struct sub {}; 

The trait add_rule receives two such empty structures as template parameters and 
produces the result of the data type oper. Two members of the mapping © are described 
by the next template instantiation: 

template<class Dpi, class 0p2> 
struct add_rule { 
typedef 0p2 oper; 

}; 

and the other two are specialized: 
templateo 

struct add_rule<sub,add> { 
typedef sub oper; 

}; 

templateo 

struct add_rule<sub,sub> { 
typedef add oper; 

}; 

We have now all tools to implement recursive function templates. These are functions 
Assign and Operation (either addition or subtraction). The first function Assign assigns 
the first argument argl due to the recursive call of itself (see below). If the argument is 
simple, e.g. vector, than the same function has to be implemented in the corresponding 
class (e.g. class Vector). The function Assign has one template parameter, which defines 
the return type. Thus, it is generalized for any operation. 

The second argument arg2 of the binary closure have to be added or subtracted from 
X. For this purpose, the function Operation is called. It receives also the second template 
parameter Lef tOp to recognize the operation. The second function Operation is recursive 
too and used especially for C++ like expressions, e.g. X+ = A — B. In this case the 
first argument A is added (first line of Operation implementation) and the second is 
subtracted. The last subtraction is obtained by compiler from two symbols {+, — } as 
data types with the help of the addition rule trait. 
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The function recursion must be finished. It means that two described template func- 
tions must be speciahzed as the member functions in each of the class (e.g. class Vector) 
that occurs in expressions resolved by this method. 

template<class Left, class Right, class Oper> 
struct BinClos { 

const Leftfe argl; 

const Rightfe arg2; 

BinClos (const Leftfe a, const Rightfe b) :argl(a) ,arg2(b) {} 

template<class Ret> 
void Assign(Ret x) { 

argl . Assign<Ret>(x) ; 

arg2 . Operation<0per , Ret> (x) ; 

} 

template<class LeftOp, class Ret> 
void Operation(Ret x) { 

argl .Operation<LeftOp,Ret>(x) ; 

arg2 . Operation<typename add_rule<Lef tOp , Oper> : : oper , Ret> (x) ; 

} 

}; 

The specialization in the class Vector is very simple using the BLAS library. The 
recursion process is initiated also in the class Vector from an overloaded operator, e.g. 
operator=, by corresponding call of the template function (e.g. Assign): 

class Vector 

double* data; 

int size; 
public : 

//definition of constructors 
template<class Left, class Right, class Oper> 
void operator= (const BinClos<Left, Right, Oper>& expr) { 
expr. Assign (data) ; 

} 

template<class LeftOp, class Ret> 
void Operation (Ret x) { 

cblas_daxpy(size , 1 . ,data, 1 ,x, 1) ; //sample specialization using CBLAS 

} 

>; 

The provided template functions can be specialized in the structure BinClos for some 
often used short algebraic expressions. It also leads to the increase in computational 
efficiency. For instance: 

templateO templateO inline void 

BinClos<Vector, Vector, add> : •.Assign<double*> (double* x) { // X=A+B 

} 

templateO templateO inline void 

BinClos<Vector, double, mul>: :Operation<add,double*>(double* x) { // X+=c*A 
} 
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Moreover, multiplication operations, such as vector-constant and vector-matrix multipli- 
cations, must necessarily be specialized, because they cannot be partially applied to the 
vector X. 

In a whole library for linear algebra we need to implement besides the binary ex- 
pression closure in the same way a unary expression closure and a unary and a binary 
function closure. It is also important to consider that the described method of recursive 
functions does not work if any unspecialized expression in some mathematical function is 
substituted, for instance: 

X = sin{A + B + C), where X, A, 5, C are vectors. (6) 

The acceptable solution allows the library to yield a temporary vector T. This vector 
receives the expression value in the mathematical function {T = A + B + C) and than is 
plugged into the function {X = sin{T)). The copying of a complicated argument result to 
the temporary vector does not reduce the computational performance significantly. This 
can be evidently proved by the following performance tests. 

Performance tests 

Three performance tests were developed to verify the functionality of the library and to 
find its weak points. 

• Test 1: Short expressions. A vector-matrix product and a vector-constant product 
are repeated 1000 times. That combination is often used in many mathematical and 
engineering computations. 

X = Ay, y = y + cx, c = const 

• Test 2: Long expressions. The sum of 7 vector-constant products are calculated 
100000 times. The test can show how slower is the evaluation using some loops 
(with the help of BLAS) and a single loop (due to overloaded operator [] ). 

y = y + ciui + ... + C7M7, y = cgy, Ci = const 

• Test 3: Long function expressions. The sum of 3 mathematical functions are calcu- 
lated 50000 times. It has the same aim as the second test. The temporary vectors 
in the computation of mathematical functions and their penalty are tested. 

y = y -\- log(ui) - cos{c2U2 + U3) + sin{c4U4 + C5U5 - uq), y = CQy, Ci = const 

The tests were performed on two completely different hardware platforms, that have their 
own specific optimized BLAS version: 

• Intel Pentium III 800MHz with Linux RedHat 7.3. 

Compiler: GNU C++, gcc 2.96 version. BLAS: Intel Math Kernel Library. 
The first three bars of the test diagram (fig. show the CPU time of the pure C++ 
implementation. It proves that the new implementation (third bar) has a perfor- 
mance similar to optimized libraries valarray and Blitz++. The implementation 
using BLAS is quicker in the most cases (test 1 and test 3), especially using hardware 
specific BLAS version Intel MKL. The last two bars in the tests 1 and 2 estimate 
the abstraction penalty, that is acceptable. 

• IBM M80 enterprise server having 8 Power3 500MIIz processors and IBM AIX op- 
erating system. (Each test uses one processor). 

Compiler: GNU C++, gcc 2.95 version. BLAS: IBM ESSL library. 
The both tests (fig. ^ do not show any abstraction penalty of the new library. The 
optimized BLAS (ESSL) library requires 2-3 times less computational time compared 
to the pure C++ implementation to the same task. 
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Test 2: long vector-constant expressions 



□ Standard C++ valarray library 

■ Blitz++ library 

□ C++ implementation 

□ Implementation using Intel MKL 

■ Standard RedHat Linux BLAS 

□ Pure CBLAS (without C++ abstraction) 



Figure 2: Performance tests on the Intel PIII-800 platform 
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Test 2: long vector-constant expressions 

IBM ESSL library □ Pure CBLAS (ESSL) 



□ C++ implementation 

Figure 3: Performance tests on the IBM M80 with Power 3 500MIIz processors 
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The same compiler with fohowing optimization options was used on the both hardware 
platforms: 

-07 -ffast-math -f unroll-loops -f omit-f rame-pointer -f expensive-optimizations 

Conclusions 

The new library for the linear algebra was developed, which uses the expression templates 
technique and the optimized BLAS to achieve higher computational performance than 
the known C++ libraries. According to the BLAS specification the library provides 
vectors and matrices of only single and double precision types. Any other types are not 
allowed, although, the algebraic expressions resolved by the library are not limited and 
implemented traditionally for both general and sparse vectors and matrices. 

The main trick of the implementation is the recursion of function templates realized by 
factitious empty classes add and sub. They have the aim to separate definition of structure 
BinClos and member function templates for addition and subtraction operations in the 
compilation time. Some old compilers do not allow to write function template if one of 
the template parameters is not a data type of an argument. For some C++ developer it 
can seem to be a mistake. But it works efficiently in the described implementation, since 
the template parameters are resolved in the compilation time without loss of the code 
performance. This code can be produced by some commonly used compilers: 

• GNU C++, gcc 2.95x, gcc 2.96x, gcc 3.x versions 

• Intel C++ compiler, version 5.0 

• KAI C++ compiler 
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